High-performance OCR preclassification trees
نویسندگان
چکیده
We prese nt an automatic method for construc ting high-per forma nce pre classifica tion decision trees for OCR. Good pre classifier s must prune the set of alter native classes to a small number without err oneously pruning the corr ect class. We build the decision tree using gree dy entropy minimization, using pseudo-r andomly gener ated training samples der ived from a model of imaging defe cts, and then ‘‘ populate’’ the tree with many more samples to drive down the er ror ra te. We descr ibe a re fineme nt of the method of [BM94] that approa ches the userspecifie d acc urac y more closely and thus allows higher pruning. The essential technica l device is a leaf -sele ction rule based on the Good-Tur ing Theore m [Good53]. Such a prec lassifier , constructe d for a panEuropea n polyfont classifie r, attains a 1% err or ra te and a 3.8 pruning fac tor, in tests on synthetic images. On re al pages printed in ten Europea n languages, the pre classifier sped up the page rea der by a fa ctor of 2.2, with no mea surable incre ase in er ror.
منابع مشابه
An OCR System for Printed Documents
This paper describes the general structure of a full automated document analysis system for printed documents. The system is based on a character preclassification stage which reduces the number of patterns to recognize and introduces a new contextual processing. This specific approach for multifont printed documents reading is based on pattern character redundancies. With the study of prototyp...
متن کاملBounded-E rror Preclassification Trees
We discuss an automatic method for constructing high-perf orma nce prec lassifica tion tree s. The role of prec lassifier s is to prune the number of classe s to a fr action of the total (by contrast, classifier s pick exac tly one class) . Dec ision tree s make fast pre classifier s; if they also prune strongly, they ca n increa se speed of the overall system significantly when used with class...
متن کاملComparative Study of Human Age Estimation with or without Preclassification of Gender and Facial Expression
Age estimation has many useful applications, such as age-based face classification, finding lost children, surveillance monitoring, and face recognition invariant to age progression. Among many factors affecting age estimation accuracy, gender and facial expression can have negative effects. In our research, the effects of gender and facial expression on age estimation using support vector regr...
متن کاملIDA: A System for Automated Sorting, Indexing, and Classification of Documents
IDA (Intelligent Document Analysis) is a modular software system, which assists to automate paper document entry. IDA consists of the following components: layout analysis, preclassification, OCR interface, fuzzy string matching, text categorization, lexical, syntactical and semantic analysis. The system has been applied to a variety of tasks: Presorting of forms, reports and letters, index ext...
متن کاملBoosting Decision Trees
A new boosting algorithm of Freund and Schapire is used to improve the performance of decision trees which are constructed usin: the information ratio criterion of Quinlan’s C4.5 algorithm. This boosting algorithm iteratively constructs a series of decision tress, each decision tree being trained and pruned on examples that have been filtered by previously trained trees. Examples that have been...
متن کامل